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ABSTRACT 

We propose a novel two-stage Gene Set Gibbs Sampling (GSGS) framework, to reverse 
engineer signaling pathways from gene sets inferred from molecular profiling data. We hy- 
pothesize that signaling pathways are structurally an ensemble of overlapping linear signal 
transduction events which we encode as Information Flow Gene Sets (IFGS's). We infer 
pathways from gene sets corresponding to these events subjected to a random permutation 
of genes within each set. In Stage I, we use a source separation algorithm to derive un- 
ordered and overlapping IFGS's from molecular profiling data, allowing cross talk among 
IFGS's. In Stage II, we develop a Gibbs sampling like algorithm. Gene Set Gibbs Sampler, 
to reconstruct signaling pathways from the latent IFGS's derived in Stage I. The novelty 
of this framework lies in the seamless integration of the two stages and the hypothesis 
of IFGS's as the basic building blocks for signal pathways. In the proof-of-concept stud- 
ies, our approach is shown to outperform the existing Bayesian network approaches using 
both continuous and discrete data generated from benchmark networks in the DREAM 
initiative. We perform a comprehensive sensitivity analysis to assess the robustness of the 
approach. Finally, we implement the GSGS framework to reconstruct signaling pathways 
in breast cancer cells. 
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1. Introduction 



A central goal of computational systems biology is to decipher signal transduction 
pathways in living cells. Characterization of complicated interaction patterns in signal- 
ing pathways can provide insights into biomolecular interaction and regulation mecha- 
nisms. Consequently, there have been a large body of computational efforts address- 
ing the probler a of signaling pathway reconstruction by using Probabilistic Boolean Net- 
works (PBNsl (jShmuleyich et q/.I 120021. IShmulevich ei^ a/.l 120031) . Bayesian Networks (BNs) 



flFrideman et al\ l200d. ISeeal et al\ l2003l. ISong et al\ |2009t) Relevance Networks (RNs) 



Butte and K ohane"20 03|), Graphical Gaussiari Models (GGMs) flKishino and WaddelfcOOOl . 



Dobra et a/.l 2004. iSchafer and Strimmeiil2005) and other approaches ([Gardner et al. 
Tenger e/a/.ll2003llAltav and Emmert-Streibll2"oioh . 
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Although the existing approaches are useful, they often represent a phenomenological 
graph of the observed data. For example, parent set of each gene in case of BNs, indicates 
statistically causal relationships. RNs, GGMs and PBNs are computationally tractable 
even for large signaling pathways, however co-expression criteria used in RNs and GGMs 
only models a possible functional relevancy, and the use of boolean functions in PBNs may 
lead to an oversimplification of the underlying gene regulatory mechanisms. Moreover, the 
aforementioned approaches purely rely on molecular profiling data generated from high- 
throughput platforms, which are often noisy with high experimental cost associated with 
them. Consequently, the reconstructed networks may fail to represent the underlying signal 
transduction mechanisms. 

On the other hand, gene set based analysis has received much attention in recent years. 
An initial characterization of large-scale molecular profiling data often results in the identi- 
fication of many pathway components, which we refer to as gene sets. Availability of several 
computational and experimental strategies have led to a rapid accumulation of gene sets in 
the biomedical databases. A gene set compendium is comprised of a large number of over- 
lapping gene sets as each gene may simultaneously participate in many biological processes. 
Overlapping reflects the interconnectedness among gene sets and should be exploited to in- 
fer the underlying gene regulatory network. Our motivation of considering a gene set based 
approach for network reconstruction falls into many other categories. For instance, a gene 
set based approach can more naturally incorporate higher order interaction mechanisms as 
opposed to individual genes. In comparison to molecular profiling data, gene sets are more 
robust to noise and facilitate data integration from multiple data acquisition platforms. A 
gene set based approach can allow us to explicitly consider signal transduction mechanisms 
underlying individual gene sets. Overall, gene sets provide a rich source of data to infer 
signaling pathways. The relative advantages of working with gene sets in bioinformatics 
analyses have been a d equately demonstrat ed ( ISubramanian et q/.I l2005l . iPang et a/.l 12006 , 
Pang and Zhaol 20081 Richards et al. 2010 ). However, signaling pathway reconstruction 
by sufficiently exploiting gene sets, a promising area of bioinformatics research, remains 
underdeveloped. 

With few exceptions, the existing net w'ork reconstruction approaches do not accommo- 
date gene sets. The frequency method in ( jRabbat et a/.ll2005l ) assigns an order to a gene set 
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by assuming a tree structure in the paths between pairs of nodes. However, the method 
is subjected to fail in the presence of multiple paths between the same pair of nodes. 
To capture the und erlying relations between nodes, the cGraph algorithm presented in 
( Kubica et al. 20031 ) adds weighted edges between each pair of nodes that appear in some 
gene set. The networks inferred by this approach often contain a large number of false 
positives. It is also difficult to incorporate prior knowledg e about regulat o r-target pairs 



in th e approaches mentioned above. The EM approach in (jZhu et adl2006l iRabbat et al. 



20081 ) treats permutations of genes in a gene set as missing data and assumes a linear ar- 



rangement of genes in each set. Nevertheless, it is necessary to develop a systems biology 
framework integrating both, identification of significant gene sets and signaling pathways 
reconstruction from gene sets. 

A central aspect of developing such network reconstruction frameworks is to understand 
the structure of signaling pathways. Signaling pathways are an ensemble of several over- 
lapping signaling transduction events with a linear arrangement of genes in each event. 
We denote these events as Information Flows (IF's). Information Flow Gene Sets (IFGS's) 
stand for the gene sets obtained by randomly permuting the order of genes in each IF. 
Thus, an IF and an IFGS share the same set of genes, however the latter lacks gene order- 
ing information or it is unordered. We hypothesize that IF's form the building blocks for 
signaling pathways and uniquely determine their structures. One plausible way to retrieve 
the latent, unordered and overlapping IFGS's from molecular profiling data is to use source 
separation approaches, such as Singular Value Decomposition (SVD) (Stage I). The true 
signaling pathways can be reconstructed by inferring a distribution of more likely orders 
of the genes in each IFGS (Stage II). 

In this paper, we design a two-stage Gene Set Gibbs Sampling (GSGS) framework by 
seamlessly integrating deconvolution of IFGS's and signaling pathway reconstruction from 
IFGS's. In Stage I, we infer unordered and overlapping IFGS's corresponding to the latent 
signal transduction events. In Stage II, we devel op a stochastic algor i thm Gene Set Gibbs 
Sampler under the Gibbs sampling framework (IGelman et al\ |2003| . iGivens and Hoeting 



20051 ) to reconstruct signal pathways from IFGS's inferred in Stage I. The algorithm treats 



the ordering of genes in an IFGS as a random variable, and samples signaling pathways 
from the joint distribution of IFGS's. The two-stage GSGS framework is novel from various 
aspects, such as the hypothesis of IFGS's as the basic building blocks for signal pathways, 
the definition of gene orderings as a random variable to accommodate higher-order inter- 
action as opposed to individual gene expression, and probabilistic network inferences. 

We comprehensively examine the performance of our approach by using two gold stan- 
dard networks from DREAM (Dialogue for Reverse Engineering Ass essments and Methods) 
initiative and compa re it with the B ayesian networ k approaches K2 (jCooper and Herskovits 
19921 . iMurphvl 1200 lb ) and MCMC flMurphvl l20dll a. iMurphvl l200lb ) . We also perform sen- 



sitivity analysis to access the robustness of the framework to the under-sampling and over- 
sampling of gene sets. Finally, we use our framework to reconstruct signaling pathways in 
breast cancer cells. 
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2. Methods 

2.1. Our concepts. An Information Flow (IF) is a directed linear path from one node to 
another node in signahng pathways which does not allow self transition or transition to a 
previously visited node. An Information Flow Gene Set (IFGS) is the set of all genes in 
an IF with a random permutation of their ordering. The length of an IFGS is the number 
of genes present in the set. Therefore, there are L\ putative information flows that are 
compatible with an IFGS of length L. We assume throughout that L > 3. An IF of length 
two serves as prior knowledge. Given a collection of m unordered IFGS's Xi, X2, . . . , X^, 
we treat the order 9j associated with Xi as a random variable. We write (Xj, 9j) to 
represent this association. Let us assume that the length of is Lj, for i = 1, . . . , m. As 
the sampling space of 0j corresponding to X^ is of size LJ, it follows that the sampling space 
of the joint distribution P((Xi, 61), . . . , (X^, 6™,)) is the set of Hili-^*' permutations. 
Sampling space of size HI^i can be computationally intractable even for moderate 
values of Li and m. As a result, our goal of signaling pathway reconstruction can be 
translated into drawing sample signaling pathways sequentially from the joint distribution 
P((Xi, 61), . . . , (Xm, ©m)) (the true signaling pathway) of IFGS's and then estimating the 
most likely signaling pathway using sampled pathways. 

Next, we present our two-stage GSGS framework. In Stage I, we derive IFGS's which 
form the building blocks of the signaling pathways. In Stage II, we develop a Gibbs 
sampling like algorithm to sequentially sample permutation orders for each IFGS by con- 
ditioning on the remaining of the network structures. 

2.2. Stage I: Derivation of IFGS's. In Stage I, we derive unordered and overlapping 
IFGS's corresponding to latent information flows to serve as input for the pathway re- 
construction algorithm presented in the next section (Fig. [1]). We use Singular Value 
Decomposition (SVD) to identify candidates gene sets. To extract coherent gene sets, the 
algorithm combines knowledge from two complementary forms of data, gene sets available 
from data bases and molecular profiling data from high-throughput platforms. We first 
select genes which appear most frequently in the gene set compendium under study. This 
frequency is referred to as degree. We identify significant genes by fitting a power law dis- 
tribution {y oc x""", a > 1) on the degrees of distinct genes present in the compendium. An 
application of SVD on the gene expression data D corresponding to significant genes leads 
to a factorization of the form Z^pxg = Upxp ■ Spxq ■ V^^^, where p is the number of genes and 
q is the number of samples. We choose column vectors from U corresponding to k highest 
singular values in SVD. In general, k is comparatively smaller than the original dimension 



of data. Following iKim and Tidorll2003l . we assume that k satisfies k{m + n) < mn. We let 
k = max{r : r(m + n) < mn} to derive the maximum number of gene sets by preserving 
the preceding criteria. It is well known that a single gene in a living cell may simulta- 
neously participate in multiple biological processes. The chosen basis vectors represent k 
potential information flows. For a specified cut-off /3, we set the top /3% entries (in absolute 
values) among k vectors as significant and other entries as zero. The non-zero locations in 
k vectors correspond to k overlapping gene sets. We further perform gene set enrichment 
analysis on the gene sets derived using SVD. The enriched gene sets represent IFGS's. 
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Figure 1 . Flow chart for the two-stage GSGS signahng pathway reconstruction 
framework. Stage I: Derivation of IFGS's using two common data resources. Stage 
II: Gene Set Gibbs Sampler successively draws sample signaling pathways of the 
underlying true signaling pathway from the joint distribution of IFGS's. 



2.3. Stage II: Signaling pathway reconstruction from IFGS's. Joint distribution 
and conditional distribution of gene sets. With increasing number of gene sets, the size of 
the sampling space for the multivariate distribution P((Xi, 9i), . . . , (X^, 6m)) is of the 
order of Hiii-^i'- Such a space might be computationally intractable even for moderate 
values of Lj and m. However, it is possible to theoretically describe this distribution under 
certain assumptions. 

Now onwards, we consider IFGS's as random samples from a first order Markov chain 
model, where the state of a node is only dependent on the state of its previous node. We 
compute the initial probability vector ttq and transition probability matrix 11 from m IF's 
(ordered paths) as follows. If there are a total of n distinct genes across m IF's, then 

vro = A...,^) (1) 

c c 

where q is the total number of times Z*'* gene appears as the first node among m IF's, for 
each / = 1, . . . , n and c = Ya=i ^i- ^rs is the total number of times r^^ gene transits to 
gth ggj^g (^i_g_ there is edge from r to s) among m ordered paths, then 

n = [Prs]nxn (2) 

where Prs = Crs/ XlLi ^rs, r, s = 1, . . . , n. 
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The computation of ttq and 11 allows us to calculate the likelihood of each of the YYiLi 
collections of IF's. The likelihood of each collection is the product of the likelihoods of m 
individual IF's in the collection. As each IF is treated as a first order Markov chain, we 
can calculate its likelihood using ttq and 11. For example, we compute the likelihood of the 
information flow z ^ y x 

V{z ^ x) = P{z) X P{y\z) X P{x\y). (3) 

The likelihood values calculated for a total of YULi ^t- collections of IF's can be normalized 
to denote a distribution of permutation ordering probabilities. However, the computation 
of niii-^i' likelihoods might be computational intractable. This serves as motivation for 
the proposed Gibbs sampling like approach. The computational tractability of our GSGS 
framework lies in sampling an order for each IFGS Xi by conditioning on the orders of the 
remaining IFGS's, with a much reduced sample space of size LJ. 

Let us write all IFGS's and their associated orderings together as (X,9), where X = 
(Xi, . . . , Xm) and 9 = (6i, . . . , 6^)- The notations are suffixed with — i to consider all 
but the i*^ component, e.g. (X, 9)_i etc., for i E {!,..., m}. We sample an order 

for the i^^ gene set Xj by conditioning on the known orders of remaining m — 1 gene sets 
Xi, . . . , Xi_i, Xj_|_i, . . . , Xm- To sample an order for Xj from the conditional distribution, 
we leave the i^^ gene set out, and compute the initial probability vector 7r_j and transition 
probability matrix n_j by following the procedure described in Eq. [T]and Eq. |2l from m — 1 
IF's. Further, we calculate the likelihoods of all possible orders 6^, j = 1, . . . , for Xj 
by conditioning on the orders of remaining m — 1 gene sets. The conditional likelihood for 
the j*'* order for Xj is given by 

I j- otherwise 

where 

Vj = P{{X,,Q, = el)\(X,QU). (5) 

For a fixed value of j, V- is computed by decomposing it into the product of conditional 
probability terms. For example, we compute the likelihood of ^ —)■?/—)■ a: corresponding 
to the gene set Xj = {x, y, z} as 

P((X„ e, = z^y^ x)|(X,e)_,) = P{z) X P{y\z) x P{x\y). (6) 

Each term on the right is conditioned on (X, and is available from 7r_j and n_j. We 
now sample an order for X,; from the conditi onal distribution using inverse Cumulative 



Density Function (CDF) (IGelman et a/.ll2003l ). The CDF of the conditional distribution 



P((Xi,ei)|(X,e)_i) is defined as 

F((x„e. = el)|(x,e)_.)) = ^pf (7) 



k=l 
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Algorithm 1 Gene Set Gibbs Sampler 

1: Input: m IFGS's Xj, i = 1, . . . ,m, prior knowledge (optional), burn-in state B and 

number of samples to be collected after burn-in state 
2: Output: m information flows (Xj, Bj), i = 1, . . . ,m 

3: At t = 0, make a random choice of order B^^'''' from LJ permutations, z = 1, . . . , m 

4: foTt = l,...,B + N do 

5: e = (B!*-^\...,B^^)f 

6: for i = 1, . . . , m do 

7: Compute p1*^ and U^l\ 

8: Calculate the conditional likelihoods £^'s (Eq. Hj) of permutations by treating 
as a first order Markov chain 

9: Draw an order of^ for Xj from the conditional distribution P((Xj, Bj)|(X, B)_j) 
10: Update the order information for Xj 
11: end for 
12: end for 

13: Return B^ = mode(Bf . . . , Bf +^^), i = l,...,m. 



for each j = 1, . . . , Li\. By sampling a number u ~ f/(0, 1) and letting F ^{u) = v, we get 
a randomly drawn order v for Xj from the conditional distribution (Eq. [7]). 

Gene Set Gibbs Sampler. In Algorithm [H we present Gene Set Gibbs Sampler, which 
leads to the reconstruction of signaling pathways from IFGS's derived in Stage I. In case 
of prior knowledge, we augment known edges as directed pairs with unordered IFGS's, and 
keep the direction of these edges fixed during the execution of the algorithm. Algorithm [1] 
outputs a list of IF's. To reconstruct signaling pathways, we start with an empty network 
of distinct genes present in the input list and reconstruct the most likely signaling pathway 
by joining IF's present in the output of Algorithm [TJ 



2.4. Burn-in state. A burn-in state in Algorithm [T] refers to a stage after which we start 
collecting sampled pathways. Samples collected after burn-in state are assumed to be 
drawn from the joint distribution of IF GS's. To determine an appropriate burn-in sta te, 
we translated the approach presented in (iGelman et a/.ll2003l . iGivens and Hoeting||2005[ ) in 
our framework to compute the ratio 



N- 



R 



N 



for three quantities sensitivity, specificity and PPV. Here, N is the total number of path- 
ways sampled after burn-in state, is the averaged within-chain variance and B^ is 
between-chain variance. Moreover, Sensitivity = TP/ (TP+FN), Specificity = TN/ (TN+FP) 
and PPV = TP/(TP+FP), where TP = number of true positives, TN = number true neg- 
atives, FP = number of false positives, and FN = number of false negatives. In practice 
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Algorithm 2 Network2GeneSets 



Input: A directed acyclic graph with n nodes 
Output: All IFGS's 
for i = 1 , . . . , n do 

if node i has no children then 

continue 
else 

if node i has children then 

add to Queue Q and the Linked List L all the directed pairs consisting of i and 
a child of i 
end if 

while Q is not empty do 

Pop an information flow P from Q 
if the last node in P, say k, has no children then 

continue 
end if 

add to Q and L, all information flows obtained by appending each child of k to 
P 

end while 
end if 

end for 

Prune information flows in L of length 2 (prior knowledge) 

Randomly permute orders of information flows in L and order of genes in each infor- 
mation flow 

21: Rotuiii all IFGS's of leiisitli > 3. 



if VR < 1.2, the choice of burn-in state and N is acceptable (also see Supplementary 
Material). 

2.5. Computational complexity. The worst case time complexity of Gene Set Gibbs 
Sampler is Nm{m + n + FL), where is the number of sampled pathways, m is the 
number of IFGS's, n is the number of distinct genes, L is the length of the longest gene 
set in the input and F — L\. As longer gene sets (L > 10) are less likely to correspond 
to information flows, the complexity arising from FL could be managed by appropriately 
selecting the length of gene sets in Stage I. Thus, the computational complexity of our 
algorithm increases quadratically with increase in the number of IFGS's, which compares 
very favorably with the Bayesian network approaches. 

3. Data Analysis 

We analyzed the performance of our proposed network inference framework by recon- 
structing three different gene regulatory networks. We obtained two gold standard directed 
networks from the In Silico Network Ghallenge in DREAM initiative. The two networks 
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are In Si lico network flMendes l2009l: IStolovitzkv e^ a/.l 120091) f rom pREAM2 and E. coU 
network flMarbach et a/.ll2009UMarbach et a/.ll2010llPrill et a/.ll2oToh from DREAMS net- 
work challenges. E. coli and In Silico networks consist of 50 nodes, with 62 and 37 true 
edges respectively. Availability of gold standard networks allows us to assess the perfor- 
mance of the proposed approach. In addition, we also implemented our two-stage GSGS 
framework to reconstruct signaling pathways in breast cancer cells. 

3.1. Derivation of IFGS's. From the E. coli and In Silico networks, two collections of 
IFGS's were derived by a direct application of Algorithm [2l Indeed, Algorithm [2] finds all 
unordered gene sets from a given network. The algorithm first finds all IF's (linear paths) 
in the network and then randomly permutes the ordering of genes in each IF. We may note 
that Algorithm [2] is more general than the standard Depth First Search (DFS) algorithm 
in that the latter does not find all the linear paths. There were a total of 125 and 57 
IFGS's of length > 3 for the E. coli and In Silico networks, respectively. These collections 
of IFGS's serve as input for Gene Set Gibbs Sampler (Algorithm [1]) . 

We also deri ve IFGS's using the C4 gene set compendium (computational gene sets) 
from MSigDB ( Subramanian et a/.l 20051 ). There are a total of 883 overlapping cancer gene 
sets and 10, 124 distinct genes in the compendium. We identified significant genes {P{X > 
x) > 0.95) by fitting a power law distribution on the degrees of 10, 124 genes (Fig. 6, 
Supplementary Material). We obtained a total of 289 genes using this selection procedure. 
We also collected 299 samples of breast cancer patients from Affymetrix HG-U133 plus 2.0 
platform. A total of 267 out of 289 selected genes could be mapped to the annotation table 
for the Affymetrix HG-U133 plus 2.0 platform. For each of the 267 genes, gene expression 
levels corresponding to exactly one probe set with highest average measurement among 
299 samples were selected. The resulting data set contained 267 rows (genes) and 299 
columns (samples). We performed SVD on the breast cancer gene expression data of size 
267 X 299 (m x n) and considered k basis vectors corresponding to k highest eigenvalues. 
As mentioned in Section 2, we chose k by setting k = max{r : r(m + n) < mn} = 141. 
To identify the most significant candidates for IFGS's, top 2% of the entries across k 
basis vectors were declared as non-zero and the remaining entries were set as zero. We 
derived a total of 138 candidate gene sets by identifying genes corresponding to non-zero 
entries among k basis vectors. We lost 3 gene sets by constraining a gene set to contain 
at least 3 genes. To measure the enrichment of gene sets, we further performed gene set 
enrichment analys is using the functional annotation tool in DAVID (IDennis et a/.l 12003 , 
Huang et a/.ll2009l ). DAVID performs gene set enrichment analysis using a modified Fisher 
Exact Test. We used Affymetrix Human Genome U133 Plus 2.0 Array as background to 
test the enrichment of gene sets. By setting the other parameters in DAVID as default, 
106 enriched gene sets containing a total of 212 distinct genes were derived. The enriched 
gene sets serve as IFGS's. 

3.2. Performance evaluation using E. coli network. We now analyze the perfor- 
mance of Gene Set Gibbs Sampler using E. coli network. Analogous results for In Silico 
network are presented as Supplementary Material. Using Gene Set Gibbs Sampler (Algo- 
rithm [1]), we collected a total of 500 networks after burn-in state which we fixed at 500. 



10 



(a) 



(c) 



(e) 



0% Prior Knowledge 



20% Prior Knowledge 



60 
40 
20 

0- 

60 
40 
20 

0- 

60 
40 
20 

0- 



20% 60% 100% 140% 180% 
40% Prior Knowledge 



20% 60% 100% 140% 180% 
80% Prior Knowledge 



(b) 



(d) 



(f) 



60 
40 
20 


60 
40 
20 

0- 

60 
40 
20 



20% 60% 100% 140% 180% 
60% Prior Knowledge 



20% 60% 100% 140% 180% 
100% Prior Knowledge 



20% 60% 100% 140% 180% 



20% 60% 100% 140% 180% 



Figure 2. Sensitivity analysis for the GSGS approach with increasing percent- 
age of prior knowledge. Network: E. coli. In blocks (a)-(f), x-axis represents the 
percentage of gene sets present in the input and y-axis plots the total number 
of edges predicted by GSGS (solid line). The dashed line plots correspond to 
the ground truth. Here, we have considered only those genes which were present 
among IFGS's after pruning all gene pairs. 

As all gene pairs are pruned by Algorithm [2l some genes might be lost and never appear 
in the input list of IFGS's. We compare the network predicted by Algorithm [1] with the 
subnetwork formed by genes present in the input. A detailed list of settings is presented 
in the Supplementary Material. With the chosen set of parameters, \/R in Eq. [8] was 
found approximately equal to one, for each of the three quantities sensitivity, specificity 
and PPV. We used the total number of predicted true edges and F-score to assess the 
performance of Algorithm [H The F-score is defined as F = 2pr/{p + r). Here, r is the 
sensitivity and p is the PPV. 

In order to accommodate the real-world under-sampling and over-sampling situations, 
we first performed sensitivity analysis of the GSGS approach using E. coli network. Fig. 
[2] demonstrates the effect of removing and adding unordered gene sets to the input list of 
IFGS's in Algorithm [TJ In Fig. [21 x-axis represents the percentage of gene sets present 
in the input list, where 20% means that 80% of the gene sets were randomly removed 
from the list of all IFGS's, and 120% means that 20% of randomly sampled gene sets 
were added to the original list of all IFGS's. In Fig. |2l we present the performance of 
our approach in terms of the total number of predicted true edges. In blocks (a)-(f), the 
number of edges identified by the GSGS approach remains close to the ground truth. We 
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Table 1 . F-scores calculated for the GSGS approach with increasing percentage 
of gene sets in the input (row) and prior knowledge (column). Network: E. coli. 
We observe a clear increasing trend in the F-scores in each row, indicating the 
positive impact of incorporating prior knowledge, while a clear trend of similarity 
is observed within each column, indicating a marked robustness of the performance 
of GSGS to the over-sampling and under-sampling of gene sets. 



also observe the positive effect of incorporating prior knowledge. As the percentage of 
prior knowledge increases (block (a) to block (f)), difference between the ground truth and 
prediction decreases. In particular, our approach does not produce a large number of false 
positives in the presence of redundant gene sets. 

In Table [H we present the F-scores for the GSGS approach with increasing percentage 
of gene sets (rows) and prior knowledge (columns). We observe that the F-scores increase 
with an increase in the percentage of prior knowledge (values in a row), and these scores 
remain close on removal or addition of gene sets (values in a column) demonstrating an 
impressive robustness to under-sampling and over-sampling. This observation strongly 
supports the applicability of our GSGS framework in the real-world scenarios, where we 
often do not observe all gene sets or the observed gene sets are redundant. 

We also compare t he performance of our approach with a number of popular network 
inference approaches (iMargolin et al. I l2006l . iMeyer et a/.ll2008l ) with a primary emphasis on 
the two Bayesian network approaches, K2 and MCMC (M etropolis-Hastings or MH) imple- 
mented in the Bayes Net Tool Box (BNT) flMurphvll20dll b. http : / / sourcef orge . net/proj ects/bnt/f ilei 
The main reasons are the following: (1). From methodology point of view our method infers 
the most probable linear structure(s) using likelihood scores calculated from the products 
of conditional probabilities. It is essentially in the same sprit as Bayesian network ap- 
proaches while fundamentally different from other approaches based on the calculation of 
pair-wise similarity. (2). Both our approach and Bayesian network approaches naturally 
take discrete data in that a collection of gene sets can equivalently be represented as a 
matrix of binary discrete values. Indeed, each IFGS naturally corresponds to a binary 
sample derived by considering the presence and absence of a gene in the set. Most of the 
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IFGS's 



Gene Set Gibbs Sampler 
(Directed Networks) 



Figure 3. A sketch of the idea behind comparing the GSGS approach with 
Bayesian network approaches. Note that the underlying network from which gene 
sets are derived is a directed network. Moreover, gene sets can equivalently be 
represented as a matrix of binary discrete values. Bayesian networks are the best 
choice in this case to fairly assess the performance of GSGS. Bayesian network 
approaches accommodate both discrete and continuous data, and reconstruct a 
directed network. 



existing network reconstruction algorithms are more suitable for inferring an undirected 
network from continuous data sets. 

In Fig. |3l we sketch the idea behind comparing our approach with the Bayesian network 
approaches. Our goal in this paper is to infer the underlying directed network. Also note 
that a collection of gene sets can be represented as a matrix of binary discrete values. A 
binary sample corresponding to an IFGS can be derived by assigning a value to the genes 
not present in the IFGS and 1 otherwise. Bayesian network approaches can accommodate 
both discrete and continuous data sets and reconstruct a directed network. The equivalent 
representation of gene sets as binary discrete data makes the comparison between our gene 
set based approach and the Bayesian network approaches very fair. In addition, we also 
generated continuous data to serve as inp ut for the Bayesian network and other approaches 
(iMargolin et"aZll2006l iMever et a/.ll2008[ ). Thus, using the same underlying network, 
e.g. the E. coli network, as the sole input (Fig. [3]): (1). We generate discrete data inputs 
for Gene Set Gibbs Sampler (Algorithm [Tj) by collecting IFGS's in the output of Algorithm. 
(2). We generate discrete data inputs for K2 and MH by considering the absence (0) or 
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Figure 4. Network: E. coli. (Upper Panel) Comparison of the GSGS approach 
with K2 and MH in terms of Total Number of Predicted Edges with increasing 
percentage of prior knowledge. In each panel "Method-N" stands for a Bayesian 
network method applied to continuous data of sample size N, and "Method-DIS" 
corresponds to using binary discrete data. Bayesian Information Criterion (BIC) 
and Bayesian scoring were used on the corresponding data sets. The dashed line 
represents ground truth. (Lower Panel) Comparison of the GSGS approach with 
K2 and MH in terms of F-score. Here x-axis represents the percentage of prior 
knowledge and y-axis plots F-scores from three approaches. 



presence (1) of a gene in each IFGS in the output of Algorithm [2l (3). We generate 
continuous data inputs for K2, MH and MINET using BNT. 
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Figure 5. A proof of principle study. Left panels show two gold standard net- 
works, E. coli (Upper) and In Silico (Lower); Right panels show the corresponding 
predicted networks by GSGS, E. coli (Upper) and In Silico (Lower). For a fair com- 
parison, all stand-alone linear paths of length 2 are removed from both networks. 
On the right panels, the blue edges correspond to true positives and gray edges 



represent false positives. Figures were generated using Cytoscape (jShannon et al. 
20031 1. 



In principle, the K2 approach ( Cooper and Herskovitd 1992 ) first specifies an ordering 
of nodes involved in the underlying network. Thus, initially each node has no parent. The 
algorithm incrementally assigns a parent to a node whose addition increases the score of 
the resulting structure the most. For the i^^ node, parents are chose n from the set of nodes 
with index 1, ... ,i — 1. On the other hand, the MH algorithm flMurphvl [200ll a) starts 
from an initial directed acyclic network Gq and selects a network Gi uniformly from the 
neighborhood of Gq. The neighborhood of a network G is the collection of all directed 
acyclic networks which differ from G by addition, deletion or reversal of a single edge. 
The algorithm accepts or rejects the move from Gq to Gi by computing an acceptance 
ratio defined in terms of marginal likelihood ratio P{D\Gi) / P{D\Go) . Here D represents 
the given data. This procedure is iterated starting from the most recent network. A 
specified number of networks are collected aft er burn-in state . For scoring a structure, BNT 
i mplements Bayesian Informa tion Criterion (jSchwartall978l ) and Bayesian score functions 
(ICooper and Herskovitd Il992l ). 
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GSGS 


CLR 


ARACNE 


MRNET 


MRNETB 


E. coli 


0.457 


0.230 


0.377 


0.303 


0.228 


In Silico 


0.431 


0.238 


0.425 


0.389 


0.327 



Table 2. Performance comparison of GSGS with four other pair- wise similarity 
based network reconstruction approaches using F-scores. The sample size is 50. 



In the upper panel of Fig. HJ we plot the results from a comparative study in terms 
of total number of predicted edges. It is clear that K2 and MH predict many false posi- 
tives. In the lower panel of Fig. HI we have plotted the F-scores for different approaches 
with increasing percentage of prior knowledge. We observe that F-scores for the GSGS 
approach is significantly higher than K2 and MH. Further, the impact of incorporating 
prior knowledge on F-score is more prominent in case of GSGS than K2 and MH. F-scores 
for both K2 and MH remain much lower than the GSGS approach even in the presence of 
a large amount of prior knowledge. For similar results using In Silico network, we refer to 
the Supplementary Material. We also compare GSGS with four other approaches without 
using prior knowledge. The F-score results are presented in Table [2J In Figure El we 
provide more detailed evidence of the superior performance of our method using both In 
Silico and E coli networks. In Figure [5], two left panels represent the true topologies of 
both networks, and two right panels represent the reconstructed network topologies using 
GSGS. In each reconstructed network, blue edges represent true positives and gray edges 
represent false positives. A high level of accuracy is observed in both the reconstructed 
networks. 



3.3. Pathway Reconstruction in Breast Cancer Cells. Before using the IFGS's for 
signaling pathway reconstruction, we validated our underlying assumption that a large 
network is built from unordered and overlapping IFGS's. We measured the amount of 
overlapping among IFGS's. Indeed, we computed the number of genes shared by different 
number of gene sets (Fig. 7, Supplementary Material). A minimum of 75% of total genes 
were found to be shared by at least two IFGS's. An exponentially truncated power law 
distribution {y oc x'^e''^^) was fitted on the degre es of genes (Fig. 8, Su pplementary 
Material). Such networks naturally occur in biology ( Ghazalpour et al. 20061 ). 

A total of 20 candidate signaling pathways from 20 independent runs of Algorithm 
[1] were predicted. To summarize a single network, we declared all the edges appear- 
ing in at least 5 networks as true edges, for a fair compromise between sensitivity and 
PPV (Fig 9, Supplementary Material). In Fig. [HI we present a subnetwork formed by 
nodes with at least 5 first order neighbors in the reconstructed network. Indeed, nodes 
with high connectivity are li kely to participate in many signaling transduction events. 
We made use of GeneCards (jSafran et all 120101 ) to verify the relevance of genes in the 
subnetwork with breast cancer and signaling events. We found that many genes, e.g. 
BMPIO, CCL2, CCRl, C0L19A1, CXCR4, EPHB2, FLTl, FOS, GNG4, ITGB5 and 
MDM2 shown in Fig. [6l are involved in the molecular mechanisms of cancer. In addition. 
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Figure 6. A partial view of the subnetwork formed by nodes with a minimum 
of five first order neighbors in the network reconstructe d from the genes rela ted 
to breast cancer. Figure was generated using Cytoscape ( Shannon et al. 20031 ). 



MDM2 is involved in HER-2 signaling in breast cancer, P0LR2I in hereditary signaling 
in breast cancer and ATF2 in Estrogen-dependent breast cancer signalingfS igma-Aldrich 
www. sigmaaldrich. com). CXCR4 is highly expressed in breast cancer cells ( Muller et al\ 



20011 . RefSeq www. ncbi .nlm.nih.gov/ref seq) whereas GJAl is marker for detecting early 
oncogenesis in the breast (Genatlas http : //genatlas .medecine .univ-paris5 . f r ). RRMl 
is located in the imprinted gene domain of llpl5.5 (an important tumor- suppressor gene 
region). Alterations in this region are associated with breast cancer (RefSeq). ATF2 
and its two direct neighbors WAS and ITGB5 participate in CDC42 pathway (Applied 
Biosystems Pathway www.appliedbiosystems.com). Similarly, BMPIO and HMGNl are 
involved in ERK signaling, and EPHB2 and KCNA5 in PI3K signaling. Genes appearing 
in the directed path from FLTl to EPHB2 via BMPIO and ATF2, and genes in the path 
from GNG4 to EPHB2 via BMPIO and ATF2, are highly relevant to MAPK signaling 
and P38 signaling. For example, BMPIO is connected to ATF2 by a linear path. It has 
been reported that TAKl and t he SMAD pathways activated by BMPs activate several 
transcription factors like ATF2 ( Monzen et al\ 200 1| ). Similarly, FLTl and GNG4 which 
are closely situated and connected by a linear path, have been reported to participate in 
many signaling events, e.g. ERK signaling, PI3K Signaling, P38 signaling and MAPK sig- 
nahng. These evidences further support the use of GSGS framework for signaling pathway 
reconstruction. 



4. Conclusion 



In this paper, we proposed a novel computational framework, GSGS, to reconstruct sig- 
naling pathways from gene sets. As far as we know, the proposed framework is original in 



17 



the following aspects: (1). It offers a unique two-stage framework for network reconstruc- 
tion by combining knowledge from existing gene sets and molecular profiling data from 
high-throughput platforms (2). The ordering of genes in each gene set is treated as a ran- 
dom variable to capture the higher order interactions among genes participating in signal 
transduction events. In most of the existing approaches, individual genes are treated as 
variables (3). The problem of signaling pathway reconstruction is cast into the framework 
of parameter estimation for a multivariate distribution. (4). The true signaling pathways 
are modeled as a probability distribution of sample signaling pathways. 

We first assessed the performance of our network inference algorithm by using two gold 
standard networks: E.coli and In Silico. Our approach was shown to have significantly 
better performance in terms of F-score and total number of pred icted edges than the 
Bayesian network and other pairwise similarity based approaches ( Margolin et al. 20061 . 
Meyer et al\ 20081 ) . Robustness of our approach against under-sampling or over-sampling of 



gene sets was proved by performing sensitivity analysis. We applied our GSGS framework 
to reconstruct a network in breast cancer cells, and verified it using existing database 
knowledge. Overall, our analyses favor the use of our two-stage GSGS framework in the 
inference of complicated signaling pathways. 

The advent of systems biology has been accompanied by the blooming of network con- 
struction algorithms, many of which treat gene pairs as the basic building block of the 
signaling pathways and reconstruct signaling pathwa ys by simultaneously detecting co- 
expressed gene pairs using molecular profiling d ata (e.g. iButte and KohandEooa! .! IZhu et al\ 
2005 . Margolin et al. 20061 . Meyer et al. 20081 ). This type of approaches enjoy simplicity 



and a much alleviated computational load but gene pairs do not represent the entire signal 
transduction pathways. Other approaches heuris tically search for the higher scored net 
work structure(s), such as Bayesian networks (e.g. Cooper and Herskovitd IQOl . Song et al. 



20091 ). Many network structures may be found to be statistically plausible, but similar to 



the gene pairs they do not necessarily represent the real signaling transduction mechanisms. 
Moreover, the computation loads of searching for a higher scored network is prohibitively 
high and a number of assumptions on the network structures have to be made, such as small 
size of the parent sets. Our GSGS framework infers the most likely signaling pathway(s) 
from a probability distribution of sampled signaling pathways using overlapping gene sets 
inferred from molecular profiling data. The reconstructed information fiows are faithful 
representation of the real-world signaling transduction mechanisms. The advantages of 
gene set based computational approaches have been adequately demonstrated in the many 
bioinformatics research areas, for example, disease classification and enrichment analysis, 
we expect our gene set based GSGS framework to open a new avenue in methodology 
research of signal transduction. 
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